User incentives for blockchain-based data sharing platforms

Data sharing is very important for accelerating scientific research, business innovations, and for informing individuals. Yet, concerns over data privacy, cost, and lack of secure data-sharing solutions have prevented data owners from sharing data. To overcome these issues, several research works have proposed blockchain-based data-sharing solutions for their ability to add transparency and control to the data-sharing process. Yet, while models for decentralized data sharing exist, how to incentivize these structures to enable data sharing at scale remains largely unexplored. In this paper, we study different incentive mechanisms for decentralized data-sharing platforms. Smart contracts are used to automate different payment options between data owners and data requesters. We evaluate multiple cost pricing scenarios for data monetization by simulating incentive mechanisms on a blockchain-based data-sharing platform. We show that a cost compensation model for the data owner rapidly cover the cost of data sharing and balance the overall incentives for all the actors in the platform.


Introduction
Today, despite large amounts of data being generated every second, data remains siloed in the databases of hospitals, companies, and research institutions around the globe. Data sharing is known to accelerate scientific research, improve business innovations, and to inform decisionmaking [1][2][3][4][5]. Yet, several factors contribute to the lack of data sharing in practice [1,3] including legislation, institutional concerns, task complexity, use and participation, information quality, and technical concerns [6]. Stringent data protection laws impede the procurement of large amounts of data. Regulations such as the General Data Protection Regulation (GDPR) [7], Federal Trade Commission (FTC) Act [8], California Consumer Privacy Act 2018 (CCPA) [9], UK Data Protection Act 2018 [10], and Australia Privacy Act 1988 [11] determine several mechanisms of data protection, amongst which GDPR is explicit on the rights of individuals to request information on collected data, the purpose of use, with whom data is shared, as well as request rectification or deletion of their data.
At a minimum, sharing personal records (i.e. patient data) requires involving individuals in the data-sharing process. This process includes the expression of informed consent which is normally collected via consent forms specifically tailored for a given study. Secondary use of data is possible when the original consent and the secondary use are compatible. This process is highly inefficient, often infeasible, and contributes to the lack of secondary data re-use [12]. The typical solution to these issues has been third-party involvement. Users consent to share their data with third parties by agreeing to license agreements, without necessarily being aware of the connected potential risks or consequences. With this approach, data is collected and curated by different organizations or companies, to often be sold to those who afford to pay the price [13]. Data collections then become a source of large profits for data processors, however, the individuals whose data are being processed are rarely involved or compensated in this process. Decentralized data-sharing networks (such as blockchains) overcome many datasharing issues by adding a transparency layer for all data transactions and by directly enabling participants to control their own records [14]. Current decentralization techniques however are not costless. Most open blockchain platforms are based on payments per transaction, thereby, a transparency layer and individual data control come with costs and with opportunities for creating appropriate incentives for a data market to emerge.
Previous research [15][16][17] has established several key benefits to share data. These include the verification of previous research, new interpretations, improved data integrity, resource optimization, guard against falsification, and facilitation of researcher training. Several stateof-the-art blockchain data-sharing platforms [18][19][20][21] fail to explore the incentives structures for data providers and requesters. In these works, fixed incentives are provided to data owners each time the data is shared with the participants. This lacks a more dynamic pricing model for data owners to collect the incentives over time. To date, there is a lack of data-sharing infrastructure that can sustainably facilitate the incentive structures to motivate data sharing on the blockchain.

Contributions
To overcome these limitations, in this paper, (i) we develop an incentive model to motivate user participation on a blockchain-based data sharing platform, (ii) we create a functioning prototype from the resulting incentive model, and (iii) finally we conduct extensive experiments and analyze the solution in scenarios simulating real-life user interactions. Specifically, the scenarios showcase the accruing operational costs inherent to the blockchain implementation. The novelty of this work is an analysis of cost-benefits for two main incentive systems, i) sharing the costs of data providers and, ii) profit-making for data providers. Our simulation indicates that a cost compensation model for data provider quickly covers the cost of data sharing.
The remainder of this paper is structured as follows. Section 2 discusses the background work related to blockchain platforms. Further, section 3 presents the architecture of the incentive model. Next, section 4 presents the implementation of our solution. Section 5 discusses the evaluation of the proposed model followed by a discussion in section 6. Section 7 discusses related work and highlights the limitations of the state-of-the-art. Finally, section 8 concludes the paper and presents future works.

Background
In this section, we explain Ethereum blockchain, incentives, and our baseline data sharing platform.

Blockchain as a decentralized network
Blockchain is a decentralized network of nodes that maintains a shared ledger of transactions. Blockchains consist of chained transaction blocks that are validated and added to the blockchain by nodes in the network. A new block is added to the ledger after being concatenated with the last confirmed block. This is achieved by adding the cryptographic hash of the previous block to the newly created block to generate an updated hash value. Once the block is added, the transactions contained in the block are permanent and immutable. Blockchains use validation nodes, also called miners to update the ledger. The validation is pre-established by a consensus mechanism that specifies what determines a valid block. Consensus mechanisms differ however they all focus on rewarding validators for maintaining the state of the blockchain. Open blockchains do not limit or control the validation nodes. However, becoming a validator requires sometimes substantial initial investments, thus it abstain representing the main incentive mechanism for data-sharing platforms since not all users will become validators. We focus on openly accessible decentralized networks, such as Ethereum [22] for the accessible ledger and the general-purpose architecture. Using the Ethereum blockchain as the starting model for monitoring data transactions, individuals are able to inspect and control their data sharing preferences. Importantly, blockchain technology circumvents the need for centralizing data into a third party and supports open data-sharing agreements that are validated by the network. Blockchain networks, by design, introduce transaction costs. This cost is a computational cost which in Ethereum is measured in gas. Gas is a measure of the computational effort required to perform an operation which is attributed an Ether value, measured in Wei. Wei is the smallest denomination of Ether. Two additional open and general purpose blockchain platforms have been recently launched, Cardano [23] and Polkadot [24]. We choose Ethereum for its well-established platform, yet the overall findings of this work apply to any of these platforms by accounting for the transaction fees and the computational costs of these other networks.

Smart contract.
A smart contract is a digital protocol that facilitates, verifies, and executes one or multiple transactions [25]. Smart contracts, similarly to real-life physical contracts, translate contractual clauses between two parties. They achieve this with rules that are written into executable code. Smart contracts are executed independently by the network nodes and become immutable after deployment. Ethereum smart contracts provide a generic mechanism for building applications that require agreements between two or more parties. Using smart contracts, the transactions become valid only when the contractual agreement are met, resulting in the storage of the transaction in the blockchain. We use smart contracts to define data-sharing and incentive rules between the data providers and data requesters.
2.1.2 Tokens. Ethereum tokens are a special sub-type of cryptocurrency, usually defined as fungible, exchangeable assets. They are created from specialized smart contracts and are mostly used to create secondary economies on top of the Ethereum network. The main advantage of tokens is a platform-wide standard practice for method definition which leads to fewer faulty contracts and easy implementation of interoperability. We use tokens for access control to data, thus providing exclusive data access based on the established agreements between data-provider and data requesters. More specifically, tokens provide a way to link the blockchain irrefutable transactions with data access control, in a way that data is not accessible to other users (i.e. data requesters) unless there was a prior agreement reached within a smart contract. ERC-20 [26] is a standard API for tokens in smart contracts that provides base functionality to transfer tokens or approval for third parties to transfer tokens. Today, there is no mechanism to protect against faulty token transactions, making them irrecoverable in certain cases. ERC-721 is based on ERC-20 and implements a token standard where each token is unique and have different values (non-fungible). This makes it useful for representing physical property and other such assets. ERC-721 tracks ownership of each token individually. Additionally, it is possible to delete the tokens and associated methods are robust against faulty inputs. However, it does not provide any type of data structure to associate tokens with individual properties. In this paper, we adapt the ERC-721 token standard to represent a unique access key to specific datasets, since it is the closest standard to our token implementation.

LUCE
LUCE [27] is a blockchain-based data-sharing platform that allows data providers to verify for which purpose, by whom, and in which time-frame their data is used. LUCE allows users to share and reuse data in compliance with license agreements. LUCE ensures compliance with the GDPR by giving the data provider personalized methods to control their data. Additionally, the data provider is able to update the data sharing license and can update or erase the data. All the changes perpetuate through the system. Data providers generally publish and update their datasets. When a dataset is published, the data provider provides information such as meta-information on the dataset, access requirements, and an access link. This information is saved in the respective smart contract. Thus, each dataset is connected to a separate smart contract. This model allows the provider intricate control over how each dataset should be accessed by requesters. If a data requester fulfills the requirements set by the data provider, they perform time-bounded access requests. Requesters are able to renew their access time.
The smart contract provides GDPR compliance, which binds all requesters to the access conditions of each respective dataset. The supervisory authority (e.g. governmental institution) is responsible for enforcing the rights of the data subjects and general prevention of abuse of the platform. If there is a legal issue, i.e. a data requester's non-compliance with the license agreement of a specific dataset, the supervisory authority is able to audit the interactions of the parties involved.
In this work, we measure the incentives to share data, and LUCE has a generic data sharing model for this analysis. LUCE has two main benefits compared to other proposed solutions: the interaction protocols for sharing data are GDPR compliant and it has been tested for scalability with real-life scenarios [27]. It shows that when compared to a blockchain baseline model, LUCE data sharing platform scales well with an increasing traffic. The paper also highlights a need to fine-tune data sharing with incentive mechanisms for end-users. Using LUCE as a basic model for creating a decentralized sharing network, in this work we focus on extending the model with incentive mechanisms and their analysis. In our approach, we showcase the accruing operational cost for data sharing on the LUCE platform and the models for balancing them across the users. There are several types of important incentive mechanisms to consider in decentralized networks: • Control-Incentives. The main incentives relating to data control in a decentralized data sharing solution are: the ability to fully control data re-use and control over data ownership. A blockchain solution to data sharing can fully support these incentives as built-in functionality, where consent mechanisms are personalized based on ownership [14].
• Research-Incentives. Especially data requesters are intrinsically motivated to use datasharing platforms due to the value of data in research. This ties into the general main incentive of the platform, which is promoting data-sharing on a large scale. This incentive is powerful for all involved parties (data requesters and providers) due to the potential results deriving from the research on the shared data (for example medical research data). Data providers may be interested in the findings but also might simply regard data-sharing as a goodwill act towards society.
• Monetary-Incentives. Monetary incentives in decentralized networks are very important to consider, especially for data providers. Decentralized networks distribute operational costs, which implies that a data provider will incur initial costs to share data and to keep them up-to-date. Monetary incentives is an incentive for data providers whereas data requesters on the other hand, may be very willing to pay for data access.
• Reputation-Incentives. An incentive that does not directly involve monetary incentives is reputation [28,29]. Data providers willing to share data on the platform to receive mentions and recognition for data re-use. This is particularly relevant to researchers who become data providers to share their data collections for further re-use.
• Knowledge-Incentives. The most important type of incentive will be created by the knowledge shared by data requesters. This is in the form of analytical models, which, if returned to data providers, provide a personalized outcome for every data provider.
In this paper, we focus on monetary incentives as these are the incentives that we realistically simulate, without extensive surveys and practical experimentation in a real-world test environment. Moreover, monetary compensation and cost allocation are the first elements to address in decentralized data-sharing networks as the occurring costs may discourage data providers from participating in data-sharing.  (i). Registry smart contract-provides authorization for data publishing and access requests.

Incentive model architecture
(ii). Dataset smart contract-handles data publishing, updates, and cost control.
(iii). Smart contract ownership-defines the connected contracts as owned by the data provider that deploys the main contract and is connected to an additional module that allows the owner to delete their smart contract.
(iv). Access smart contract-handles access and access renewal requests by data requesters and is connected to the ERC-721 token generation contract.
(v). ERC-721 Smart Contract-adapted token standard that handles the token logic that is key to accessing the data.

Registry smart contract
The global registry smart contract interfaces with LUCE smart contracts to provide exclusive access to individuals. This registry is deployed and controlled by the institution responsible for verifying a registrant's information. When a user registers, their information is connected to a wallet in the blockchain, i.e. they are anonymous, yet unambiguously associated with their valid license information. Thus, a user's public key is synonymous with their identity, and, since it is impossible to deduce the identity of the owner from a public key, they act anonymously. The only information associated with these public keys is the requester's license or the provider's publishing permission, and the only parties privy to identifying information are the owner of the key and the authority that verified the owner's identity. When an individual performs their first transaction on the blockchain, e.g. publishing a dataset or requesting access to a dataset, their registration information is verified internally. This ensures that no unauthorized individual interacts with the relevant smart contracts, even if they possess the knowledge to circumvent the data sharing platform. However, this centralized control structure functions only as a gateway to the platform and has no influence on the actual data-sharing process, any possible monetary transactions, or even any purview of how the platform is used.

Dataset smart contract
The dataset smart contract establishes control for the data provider over their dataset. Each dataset must be published on a separate dataset smart contract. Due to GDPR requirements, each update that results in a change in the meta-information of the respective dataset requires all active data requesters to confirm their compliance. Specifically, they will be notified of the update, and until they have updated their own copy of the data and confirmed this via a special compliance function, the respective requester cannot perform access requests to the data. If the data provider changes the required license to access the data, all data requesters with access to the data are notified and the tokens corresponding to the wrong license type are deleted by the system. The affected data requesters must confirm their compliance with this change. Finally, the data provider establishes how the contract handles arising costs.
(i). Scenario 1. No compensation-each party pays only their own arising costs.
(ii). Scenario 2. Cost compensation-the data provider's costs are covered by the data requesters.
(iii). Scenario 3. Profit-the data provider seeks to profit from sharing their data.
Generally, the scenarios are meant to showcase how the system reacts to different incentives being implemented. Scenario 1 represents a case with no incentives apart from those naturally arising from using the system, i.e. data providers are most likely disincentivized from using the system since they incur costs by using it. Scenario 2 seeks to remedy this by implementing a structure that asks data requesters to pay a fraction of the provider's total running costs at the time of their request. This results in a gradual decline in running costs for the provider, which represents a fairness consideration. Therefore early data requesters will pay more than later data requesters. Finally, scenario 3 shows how profits generate, and how soon the break-even point is reached.
To test these scenarios, the dataset smart contract allows data providers to manipulate settings regarding cost allocation. Data providers set a percentage profit margin that describes the total earnings aimed for it.

Smart contract ownership
This module establishes a method to control which individuals (i.e. public addresses) call certain core functions of the underlying contracts, such as issuing an update to the data. When a data provider deploys their copy of the template smart contract to publish a dataset, their address is immediately noted as the owner of that smart contract, and all smart contracts that inherit it. The most important function that needs the authorization of the owner is the destruction of the contract and all super and subordinate contracts. This function is implemented in a smart contract sub-module, which allows the owner to send all funds from the internal balance of the smart contracts to their public address while setting all internal variables to zero. Therefore any subsequent call to this contract will be voided. With this, we implement the data providers' right to delete their data (GDPR, Article 17 [7]). However, it is important to ensure requesters are adequately informed of this change otherwise they can mistakenly transfer funds to the destroyed contract, which results in those funds being lost forever. Therefore we automatically delist a deleted dataset's contract address from the data catalog.

ERC-721 smart contract
The purpose of generating tokens as access keys to datasets is that they represent a fixed, standardized data structure that is easily interfaced. For this, the token must supply several properties: It must be unique, provide adequate control methods and internal data structures, and be easily traceable. The ERC-721 smart contract module establishes a list of all tokens generated. Factually, a token is simply an entry in this list, represented by a unique ID that unambiguously identifies it. This ID is associated with an owner, i.e. the individual (public address) that minted it. Only the owner can transfer the token to another individual. The transfer of a token results in all associated values being accessible and controlled by the new owner. Since requesters should not have the ability to transfer their token to other requesters, we created a new structure that associates the token ID with its user, i.e. requester. This results in the user of a token only having limited control over the tokens. They are able to use it for three purposes: accessing the data, renewing access time to the data, and deleting their access to the data. Moreover, we created an internal data storage structure that saves meta-information on the requester and the token (e.g. license, access time, etc.), which only the data provider, respective data requester, and supervisory authority can access. By limiting access to this information we protect the privacy of the data requester.

Access smart contract
This contract holds the methods for data access and access renewal requests, implements cost coverage and GDPR compliance systems, and allows data requesters to relinquish their access if it is no longer needed. Whenever a data requester performs an access request, this contract establishes a connection with the LUCE registry to confirm their license. In addition, we also implement the cost coverage system, which applies to the settings controlled by the data provider. If all access requirements are met, the contract will generate a unique token via the ERC-721 contract [30]. This unique token serves as an access key for the data requester to the data. Fig 2 shows an overview of the methods data requesters have at their disposal.
When the data requester successfully gains access to the data, by default they are granted two weeks of access time, after which they must either actively delete their copy of the data, or renew their access time. We implement methods for both options. Access time renewal necessitates that the data requester actively confirms their compliance with GDPR requirements following an update by the data provider. The compliance function signifies that the requester has actively confirmed their compliance with all past updates. This serves as a marker for the supervisory authority should there ever be a complaint against the respective data requester that requires investigation. If this requirement is fulfilled, the data requester is given more access time. Finally, if the data requester wishes to relinquish their access to the dataset, they can do it by disassociating their public address (i.e. anonymized identity) with the token. This causes the respective data requester to lose access to the data unless they decide to renew their access request.

Implementation
In this section, we provide the implementation details of the user incentive model proposed in this paper.

Experimental setup
The incentive model is implemented on top of the Ethereum blockchain. We implement the smart contracts of the incentive model in Solidity [31], a language for smart contracts provided by Ethereum. It uses Web3 javascript libraries [32] to interact with the Ethereum blockchain. It uses Django [33] for implementing the user interface. The data providers interact via the Django web framework to share the data and specify the associated incentives. It stores the link between the smart contract and the corresponding datastore location. Through the LUCE platform, the model interacts with Ganache [34], a test network that creates a virtual Ethereum blockchain and generates pre-configured accounts that are used for the development and testing. We use Ganache to generate 1000 accounts which are prefunded with 100 Ether. The accounts are pre-funded which enables the deployment of the contracts. Ganache provides the balance in ether and notifies the gas used for running the transactions. Gas consumption varies based on the complexity of the functions defined in the smart contract. We consider the gas price of 72 Gwei according to the current date (22/06/2021) with corresponding Ether price (1 ETH == $1716.52) [35]. Our incentive model implementation is available as open-source (https://github.com/vjaiman/LUCE_Incentives).

Data provider cost allocation control
In our incentive model, the running costs after a transaction are equal to the running costs before a transaction in addition to the cost of the transaction times the profit margin.
The profit margin describes the total earnings aimed for, expenses and returns, and set via the setProfitMargin function. If a data provider wishes to perceive no profit, it is equal to 100% i.e. 100% of the pure costs of the data provider. If a data provider wishes to generate profits from sharing their data, they must declare their desired earnings as a linear combination of their costs. In addition, by calling the setMultis function, the data provider controls the percentage of the running costs that each data requester must pay upon access or access time renewal request. The providerGasCost modifier applies regardless of the running scenario and represents a convenient way for the data provider to keep track of their running costs in all scenarios. By using this modifier to measure costs arising from publishing data, we essentially ask the data provider to perform an initial investment. This is beneficial for several reasons. First, it discourages poor quality data from being shared. Second, it reduces the complexity of the system by a large margin, since the alternative is employing meta transactions (a special type of transaction that is signed by one individual and then published so that an arbitrary different individual executes them in the name of the signer MetaTransactions) which allow the data provider to sign a prepared transaction. Afterwards, the data requester transacts the data provider's signed transaction to the blockchain and thus pay the associated gas cost directly.

Data requester methods
In this section, we explain the core functionalities of the smart contracts used in our incentive model.
• Request access. Access rights are distributed via tokens, which are associated with the data requester once their legitimate claim has been verified. To do this, a data requester has to follow a range of requirements; i) a dataset must be published, ii) the requester must not yet own an access token to this dataset, iii) the requester must be registered and possess the same license as is required for accessing the data, and iv) finally, smart contract checks for which scenario it is running. If it is scenario 2 or 3, a requester must submit an appropriate amount with their access request. Once the data requester receives an access token, they call the getLink function to download the dataset.
• Renew access time. The access time associated with any access token is fixed to a reasonable amount of time (e.g. 2 weeks). If a data requester needs the data access for longer, it renews the access time. For this, a data requester must have an access token to that specific dataset. Second, it must have confirmed compliance with any previous updates. The confirmCompliance function allows data requesters to notify the system of their GDPR compliance following an update, which allows them to renew their access time to the data.
• Relinquish access. The data requester with a token has a limited range of actions they can take, the most relevant of which are accessing the data, renewing their access time to the data, and deleting their token should that ever be required. To delete their token, a data requester must call the burn function, or the smart contract calls it upon a change in the license requirement. When this requirement is fulfilled, the function first notes the remaining access time (0 if the access time is expired). Then, the internal _burn function of the ERC-721 token standard is called, which associates the token with the null address i.e. it can no longer be used. Regardless of how the function is called, the data requester is notified of the event. If the token deletion was issued by the data requester, their compliance is set to true since token deletion should always involve the deletion of the requester's copy of the dataset as well. If the token deletion was issued by a change in the license type, compliance is set to false.

Evaluation
In this section, we evaluate the effectiveness of monetary incentives. Our evaluation aims at answering the following questions: 1. How do costs arise over time from using the system? 2. How long does it take to cover the costs in scenarios 2 and 3?
3. How to find a balance between cost coverage for the data provider and fair payment amounts for all data requesters?

Initialization
Our simulation runs each iteration of the loop which signifies the passing of 1 period. In each period multiple actions are made. An action in this context refers to one of four possible decisions being made: publishing data, updating data, requesting access, or renewing access time.
Each potential data provider and data requester is associated with a certain probability of taking action. We assume that the chance of data requester taking action underlies normal distribution parameters with independent, identically distributed variables, since this is the most commonly occurring distribution in nature: We center our distribution around 0 (μ = 0) and assume standard deviation is 0.1 (σ = 0.1). To associate each account with a normally distributed probability, we first generate 1000 random values of a normal distribution with the aforementioned parameters. Since the resulting values do not lie between 0 and 1, we normalize them. This results in a vector of random, normally distributed probabilities, which we append to the user accounts list. Thus, a data requester will, on average, have a 50% probability to perform an access request in a period. However, since we do not expect data requesters to require access to a specific dataset for an indefinite amount of time, we adjust their probability of taking action downwards by a factor of 0.75 each time after they renew their access time to the data. This results in data requesters renewing their access time only very rarely after the fifteenth time (corresponds to 0.5 � 0.75 15 = 0.668%). Thus we achieve a natural balance of data requesters starting, continuing, and stopping to renew their respective access time and avoid exponential growth of actions being taken per period, which will be highly unrealistic. We do not simulate data requesters burning their tokens at that point, since it is irrelevant for the data provider's costs.
For data providers, we assume that the probability of choosing to publish is far lower than for an average data requester making an access request. Therefore, each data provider is given a uniformly distributed probability to publish that lies between 1% and a maximum probability specified by us (default is 5%). This overwrites the normally distributed probability assigned to the Ganache accounts designated as data providers. This reflects our assumption that data providers are generally less numerous than data requesters and thus take action less often.

Assumptions.
The simulations are performed under the following assumptions about data providers and data requesters: (i). The probability of a data provider deciding to publish their dataset is lower than the probability to update it after publishing.
(ii). The probability of both publishing and updating a dataset is constant, independent of consequent potential costs arising, and independent of the number of data requesters who have access to the dataset.
(iii). The probability of publishing is independent of the type of dataset.
(iv). The probability of data requesters taking action decreases over time. Therefore, no data requester will continue to renew access to a single dataset indefinitely.
(v). Data requesters have an unlimited amount of money potentially available to request access or renew access time to datasets.

Simulation
The first action in each simulation instance is the first data provider publishing their dataset.
In each period we check for each of the four possible actions: • Publish: exactly 1 data provider has the chance to publish (denoted by their probability of taking action). Until they do publish, no other data provider will be able to publish. This represents the passage of time (periods) between different providers publishing their data.
• Update: each data provider with a published dataset has the chance to issue an update. We assume that a data provider, once they published their dataset, is legally required to update it regularly, and we increase the chance to update by a certain factor.
• Request: one data requester has the chance to request access to a randomly determined dataset among those available. If the data requester fails to request access, they will have the same chance to request access in the next iteration of the loop. Similarly, all the data requesters in the system have the opportunity to perform a request for access. This simulates the potential time gap between different requesters performing access requests.
• Renew: each data requester with an access token will have the chance to renew their access time to the data. In our simulation, we assume that requesters will only renew access time if it has expired since this is economical behavior. A data requester may not know precisely for how long they need access, thus we add access time only when needed, especially since potential costs in scenarios 2 and 3 are likely to be lower with each passing period.
We simulate the passage of time by assigning probabilities to users that might or might not take an action. On the other hand, we attribute access times in real seconds to the tokens generated upon a successful request or access renewal. Since the simulation is flawed if these two systems do not operate synchronously, we implement a condition that disallows access time renewal until 2 periods after the requester's last action. This reflects the idea that a period is roughly equivalent to a week, thus each data requester renews their access to the data for two weeks.

Determining optimal parameters
As seen in Table 1, the most pivotal variables (apart from the scenario itself) are the action-Ticker, and the cost fraction data requesters must pay when making access requests or renewing their access time. We simulate scenario 2 to determine the optimal values for these variables since this is the most dependent on actions. We observe that a high percentage cost distribution (i.e. the fraction a data requester must pay in return for access) leads to a too rapid decline in the running contract cost and immediate coverage of new arising costs whenever the data provider updates. It is inherently unfair to the data requesters since some will pay high amounts while others pay almost nothing. On the other extreme, when data requesters pay only a small fraction of the running contract costs we observe a balancing of revenue and expenses above zero, which is not the goal of scenario 2. Thus, we conclude that the fraction must lie between the extremes to be effective i.e. 5% cost coverage and 500 actions. The profit margin for scenario 3 is set to 200%, meaning the data provider's total earnings in this scenario are exactly double that of their costs (making for 100% pure profit after covering costs).

Cost analysis
Transactions on the Ethereum network have a gas cost that is directly proportional to the internal operations of the respective function call in the smart contract. Specifically, storing data on the blockchain is relatively expensive, therefore, the cost of writing to the blockchain scales with the size of the content. Thus, the deployment cost of a new smart contract is generally quite high compared to transactions resulting from calling the functions of that smart contract. Table 2 describes the cost parameters used in the incentive model. Table 3 shows the base costs of the core functions of LUCE whereas Table 4 shows the cost of the core functions of the LUCE registry smart contract. These are the pure transaction costs resulting from calling the respective function, which equates to scenario 1. In scenario 2 and 3, the request and renew Table 2. Cost parameters used in the incentive model.

Parameter Representation
totalCost a running total of all arising costs, regardless of how or where they arise.
transactionCost the total cost of the transaction resulting from the user's action.
currentExpectedCost the expected cost for a data requester before they initiate a transaction nextExpectedCost the expected cost for a data requester after they initiate a transaction.
providerEarnings a running total of the amount transmitted to the contract as payment.
providerCost a running total of the costs arising from the provider taking action (i.e. publishing or updating their data).
https://doi.org/10.1371/journal.pone.0266624.t002 functions require additional funds to be transmitted with each function call. As mentioned before, the costs to update a dataset scale with its active users. Therefore the cost is relatively low when there is no data requester ($5.40), and far higher when there are e.g. 60 data requesters ($64.30), which constructs for roughly $1.07 per requester for an update. Fig 3 shows that these comparatively higher costs are still easily covered by the system. It shows the profits generated in each scenario. We see that after approximately 40 periods in scenario 2, costs are completely covered, whereas, in scenario 3 the break-even point is reached faster, and positive returns are measured as soon as period 16.
The cost of updating the meta-information of the data in the smart contract scales with the number of requesters since each requester must be notified of that update to give them a chance to comply. Fig 4 displays the relationship of running contract costs (grey line; the spikes are updates) and individual transactions in more detail. We observe that the running costs of a smart contract are influenced by individual transactions made by the data provider and data requesters. Here, we observe rising update costs (the blue X marks) and sinking access costs over time (the orange squares and plus signs). Each data requester in this scenario pays 5% of the running costs at the time of their request. With this setting, data providers in scenario 2 veritably expect that their costs will always be covered under the condition that data requesters continue to use their dataset. If the dataset loses its value, cost coverage may take a longer time, or, in extreme cases, costs may not be covered. In our simulation, the only difference between scenario 2 and scenario 3 is the profit margin. Profits in scenario 3 are effectively a linear multiplication of costs in scenario 2 and follow the same arguments. However, since scenario 3 is explicitly profitable, it reaches the break-even point faster in proportion to how high the profit margin is set. We also observe the change in additional costs for data requesters. After initial deployment (periods [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20], costs for requesters are higher than otherwise (periods after 20). In Fig 4, there are 59 data requesters in total, simulated over 118 periods. Specifically, 27 updates to the data (frequency 0.22/period), 59 access requests (frequency 0.48/period), and 418 access time renewals (frequency 3.54/period). This constructs a total of 505 actions and reflects our assumption that there are far more data requesters than providers. The initial cost for a data requester is dependent on which scenario we are simulating. As mentioned in  Table 3, the base cost of requesting access is $58.70. In the other two scenarios, a variable additional price is added to cover the data provider's cost or generate the data provider's profit respectively.
Figs 5 and 6 show requester costs specific to each scenario. We observe the average base transaction cost for requester action types and the additional cost stacked on top (which the requester bears instead of the provider in the case of scenarios 2 or 3 respectively). Compared to requesters' individual costs, the data provider has much higher costs, as shown in Fig 6. Over 118 periods, data providers must invest between $1445 to $3877. However, as demonstrated by our simulations, even the relatively high initial costs of deployment are expected to be quickly recovered by the data provider in the case of scenarios 2 and 3. This reflects the assumption that there are far more data requesters than providers. Otherwise, data providers are likely be forced to set higher cost allocation fractions to cover their costs. For a more detailed overview of what range of costs each user of the platform can expect, we plot the simulated cost distributions based on each action type in Fig 7 in a logarithmic manner. Fig 7 shows that there are few outliers concerning the cost distribution among data requesters, both when initially requesting access, and when renewing that access. This is attributed to the fact that the first five to ten early requesters cover the majority of initial deployment costs, which are generally much higher than update costs. This unequal distribution of costs are smoothed out by choosing a smaller fraction to denote the percentage of the running costs requesters must cover. If this fraction is chosen too small, it will lead to the data provider's cost not being covered, which defeats the purpose of scenario 2.
With the simulations of the three distinct scenarios, we show that depending on the parameters set in the smart contract, (a) data providers face considerable up-front costs to cover the deployment of pertinent smart contracts, and (b) the initial investment, all running costs incurred through updates, rapidly recovered by data providers. Importantly, this suggests that both cost and monetary incentives effective means to motivate data providers to participate in the LUCE platform. In scenario 2, the ability for data providers to quickly recover up-front investments minimizes the disincentive that up-front costs would otherwise manifest. Consequently, the main positive incentives in scenario 2, pertinent and reputation, will likely not be significantly diminished by cost. Scenario 3 extends this by additionally introducing a monetary incentive. Here, costs incurred by data providers are covered with even stronger kinetics than in scenario 2, and they additionally benefit from profits, determined by the profit margin they set. Through our implementation of how payments by data requesters are calculated, data providers effectively cannot profit infinitely, depending on the frequency with which they perform updates to their data. The profit calculation is directly derived from occurring costs. Therefore, if data requesters sufficiently outnumber data providers, there will come a point where the data provider has fully achieved their desired profit because it is a linear combination of their costs. From that point, only new costs incurred by the data provider, e.g. an update to their data, will result in net profit. This effectively limits how much a data provider ever profit from high demand and since the same calculation is used for scenario 2, where no profit is generated, high demand will similarly result in costs being covered completely, which means requesters have no additional costs from requesting access to the data. In such cases of extremely high demand, it is a valid fairness consideration of the data provider to lower the percentage of the running costs each data requester must pay. Conversely, if there is extremely low demand, the data provider may wish to increase this percentage. As such, we provide the data provider the tools they need to control how their costs are covered or profits are generated.

Incentives
Our results show that in scenario 2 the costs of the data provider are quickly recovered. An important question that remains is how long this will take in the real world? This time should not be unreasonably high. If we assume that one period equates to one week, then complete cost coverage will take approximately seven months. Conversely, if we assume that a period is a day, it will take less than one month to cover all costs. However, since this is based on stringent assumptions about the users of the system, it is impossible to deduce the number that reflects reality. The only way to reasonably predict this will be a study that surveys how data subjects, providers, and requesters act if they had access to the system. Nevertheless, given the low relative costs of data provision for the presumed participants, even a conservative estimate of cost-recovery over several months does not present a significant disincentive for data providers.

Costs
Additionally, we do not consider costs resulting from data preprocessing prior to any data analysis. Large data providers (i.e a medical center) need to employ people to facilitate the compilation of relevant data to be shared on the LUCE platform. These costs will be injected into the smart contract logic, and data requesters will ultimately defray these additional costs. However, if our assumption holds that data requesters far outnumber providers, this additional cost will likely not outstrip the costs by an insurmountable margin.

Related Work
Several works have focused on data-sharing incentives for decentralized networks. Shrestha et al. [19] introduces a basic functioning framework for data-sharing via blockchain authentication. Apart from the system's inherent data-sharing incentives, the authors focus on a monetary compensation incentive for data providers. The authors, however, fail to consider the specifics of the incentive mechanisms. For example, whether profit is generated for providers or if the system aims to break-even.
In this paper, we contribute a detailed perspective of costs resulting from data-sharing platforms utilizing a comprehensive, extended, and easily reproducible prototype with sophisticated smart contract logic. We show how users are incentivized to participate in the platform, and what ramifications different cost allocations result in the system. The Ocean protocol [36] functions as a Marketplace listing all available datasets. Data providers hold the data themselves and only release it when there is a legitimate request, verifiable through a respective entry in the underlying blockchain smart contract. The economy of Ocean is based on their in-house crypto-token called OCN. The OCN token discourages sharing poor quality data by implementing a staking mechanism that ties the provided data to personal assets-high-quality data result in reaching the break-even point quickly [37]. The drawback of their in-house token is that it adds a layer of complication to the system that neglects to ensure asset value-retention. This is due to the fact that Ocean actively avoided implementation of price stability due to performance concerns. Another drawback is the lack of autonomous tools for the data provider and data subject to directly, effectively facilitate GDPR compliance [7]. Our incentive model factors for GDPR compliance of data providers and calculates incentives considering that shared data may need to be updated or deleted based on the requests of data subjects.
Guidi et al. [38] describe pinning services in Interplanetary File System (IPFS) [39] where nodes can pin their data as important and the data will be not removed from the network. Pinning services improve the speed of retrieving data from IPFS. In our work, we avoid using data sharing models built on IPFS since due to its decentralized nature, enforcing data erasure as required by GDPR across the entire IPFS network is currently infeasible [40]. Xuan et al. [41] offer a mathematical analysis of participation strategies in blockchain-based data-sharing applications based on game theory. Authors derive four conditions for which they model user participation in the system and create an incentive method that results in a stable user base, i.e. no over or undersaturation of users willing to share data. This provides a basis for a more sophisticated simulation that derives participation probabilities from gain functions and pricing strategies. However, the authors do not detail the data requesters' payment structures for acessing the data. In contrast, our incentive-based approach gives a balanced view of the system with different incentive strategies and is GDPR compliant.
Authors [20] create smart contracts by giving the data providers full transparency over who accesses their data, and for which purpose they use the data. They specify a range of purposes of data sharing and provide an incentive to the data providers for sharing their data as specified in the contracts. The incentive is given in the form of ether by transferring to their addresses. Nizamuddin et al. [42] proposes an incentive framework between publisher and author to govern the sales of e-books. It handles cases related to the delivery of e-books, failure of downloads, and refunds. However, in these works, fixed incentives are provided each time the data is shared with the participants. In our work, we detailed a more dynamic pricing of incentives that is collected by data owners over time. Ersoy et al. [21] determine a fee-sharing function that distributes the transaction fee among the propagating nodes and the round leader. For an accepted transaction that is going to be included in the block with fee F and propagation path P, the function F determines the shares of each node involved. The authors presented a theoretical framework of the model and the detailed effect of the incentive model which depends on the parameter choice is left for future work. However, in our work, we investigate our incentive model and contribute a detailed perspective of incentives resulting from using such data-sharing platforms. Some other incentive mechanisms are also proposed [43,44] to motivate user participation in the data sharing. Reputation-based approaches [28,29,45] have also been proposed where service providers and requesters are not supposed to be trusted. Service requesters use reputation-based credentials to choose the service providers which is a perception of the service provider's past behavior. Privacy-preserving incentive mechanisms [17,46] such as ReportCoin [46] where it motivates users to publish anonymous reporting and incentive is received via their Rcoins. However, in this paper, we only consider and simulate the monetary-based compensation. Some other approaches [47][48][49] include incentive mechanisms for data sharing in IoT and clouds. The authors' approaches include the Shapley value, which is commonly used for resource sharing and revenue distribution models. However, the authors also raised the challenge of achieving a fair distribution of benefits. In our future work, we will test application in a closed environment with real participants to understand the behavior towards the system and how incentives contribute to it.

Conclusion and future work
In this paper, we present incentive mechanisms for blockchain-based data sharing platforms. We propose multiple smart contracts that dynamically adjust incentives and participation costs. Using multiple cost pricing scenarios for data owners we simulate data monetization strategies. We conclude that a cost compensation incentive model rapidly cover the cost of data sharing, thus encouraging data owners to share data in the platform. In the future, we will study end-user interactions to best understand other forms of incentives, such as knowledge sharing, and how it impact the dynamics in a data-sharing network. We will also further explore other monetization strategies and generate more sophisticated simulations that derive participation probabilities from pricing strategies.